context : allow cache-less context for embeddings #13108

ggerganov · 2025-04-25T11:15:32Z

target #12799

There is no need to create a KV cache when using embeddings-only models such as BERT.

ggml-ci

ggerganov · 2025-05-02T14:51:50Z

I'll work on rebasing and merging this next - it should be a good improvement for embedding models by reducing the allocated memory during inference.

Green-Sky · 2025-05-02T14:56:18Z

examples/embedding/embedding.cpp

@@ -49,7 +49,7 @@ static void batch_decode(llama_context * ctx, llama_batch & batch, float * outpu
        }
    } else if (!llama_model_has_encoder(model) && llama_model_has_decoder(model)) {
        // decoder-only model
-        if (llama_decode(ctx, batch) < 0) {
+        if (llama_encode(ctx, batch) < 0) {


Is this really right?

Not yet, needs a bit of work.

One of the main changes in this PR is that we will start using llama_encode() when computing embeddings and this change is part of the tests that I did.

ggerganov added 19 commits April 25, 2025 13:47

kv-cache : serparate recurrent vs non-recurrent impl (wip)

1a31566

ggml-ci

kv-cache : init -> contructor + add llama_memory_params

d564115

ggml-ci

kv-cache : fix callback reference

709ade1

ggml-ci

context : llama_kv_cache -> llama_memory_i

c55fc45

ggml-ci

context : move memory creation logic to model

5909d35

ggml-ci

llama : remove reference of memory during encode

e41dcac

ggml-ci

kv-cache : hide padding details in the implementation

7e79438

ggml-ci

kv-cache : add ubatch_next()

733babc

ggml-ci

context : simplify sbatch logic

b46d574

ggml-ci

kv-cache : hide defrag logic in the implementation

60c138f

ggml-ci

context : hide kv cache details in implementation

f2175ca

ggml-ci

build : fix

21eef7d

ggml-ci

cont : another fix

5d28934

ggml-ci

kv-cache : simplify interface (wip)

43fbc5f

ggml-ci

kv-cache : use separate KV cell structs for unified/recurrent

d3f22ea

ggml-ci

kv-cache : clean-up

bb81bfd

ggml-ci

model : better llama_model::create_model() signature

56dfde4

ggml-ci

kv-cache : fix recurrent seq_rm()

dec80ac

ggml-ci

context : allow cache-less context for embeddings

5f5c3b7

ggml-ci

ggerganov mentioned this pull request Apr 25, 2025

kv-cache : separate recurrent vs non-recurrent impl #12799

Merged

8 tasks

github-actions bot added the examples label Apr 25, 2025

ggerganov added 2 commits April 25, 2025 14:39

context : enable reranking with encode()

2dba70d

ggml-ci

context : encode() clears embd_seq

4f0ea9b

ggml-ci

github-actions bot added the server label Apr 25, 2025

ggerganov force-pushed the gg/llama-kv-cache-v6 branch 5 times, most recently from 58115a2 to 7e79a42 Compare May 2, 2025 13:02

Base automatically changed from gg/llama-kv-cache-v6 to master May 2, 2025 14:48

Green-Sky reviewed May 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

context : allow cache-less context for embeddings #13108

context : allow cache-less context for embeddings #13108

ggerganov commented Apr 25, 2025

ggerganov commented May 2, 2025

Green-Sky May 2, 2025

ggerganov May 2, 2025

context : allow cache-less context for embeddings #13108

Are you sure you want to change the base?

context : allow cache-less context for embeddings #13108

Conversation

ggerganov commented Apr 25, 2025

ggerganov commented May 2, 2025

Green-Sky May 2, 2025

Choose a reason for hiding this comment

ggerganov May 2, 2025

Choose a reason for hiding this comment